In this tutorial we will discuss DNA methylation data analysis, for data obtained from the Illumina HumanMethylation 450K Bead Array. This array investigates DNA methylation using bisulfite conversion in a quantitative manner at around 480,000 CpG sites in the human genome.
We will process the example data using a taylored pipeline using aspects of different established software packages, in particular the wateRmelon Bioconductor package, developed by our group. Please note, that there are many other options and software packages that offer alternative routes to 450K array data preprocessing.
Our example data comes from the wateRmelon package, feauturing a subset of the 450K probes. The tutorial will feature the following data processing steps:
Preparing the workspace
Uploading the data
Basic filtering
Pfilter
Normalisation
Exclusion of problematic probes
Linear model
T test
Plotting
In a first step we set the working directory to the new tutorial folder
setwd("/home/course/450K")
Now upload the required R packages for this tutorial into your R workspace. These include the 450K preprocessing package wateRmelon and methylumi as well as the graphics packages ggplot2, and qqman.
library(wateRmelon)
library(methylumi)
library(ggplot2)
library(qqman)
In a normal setting you will often read in a finalReport file as produced from Genome Studio. You would then match up the phenotypic and feature data in the pData and fData data frames of the methylumi object - making sure you match by correct sample ID and probe name. Here we will skip this step since we use a prepared methylumi object
DNA_M <- methylumiR("finalreport.txt")
Instead we are loading in the melon example data from the wateRmelon package:
data(melon)
To get a very broad overview of the methylumi object, we use the class command and call the methylumi object
class(melon)
## [1] "MethyLumiSet"
## attr(,"package")
## [1] "methylumi"
melon
##
## Object Information:
## MethyLumiSet (storageMode: lockedEnvironment)
## assayData: 3363 features, 12 samples
## element names: Avg_NBEADS_A, Avg_NBEADS_B, BEAD_STDERR_A, BEAD_STDERR_B, betas, Intensity, methylated, pvals, unmethylated
## protocolData: none
## phenoData
## sampleNames: 6057825008_R01C01 6057825008_R01C02 ...
## 6057825008_R06C02 (12 total)
## varLabels: sampleID label sex
## varMetadata: labelDescription
## featureData
## featureNames: cg00000029 cg00000108 ... rs9839873 (3363 total)
## fvarLabels: TargetID ProbeID_A ... (38 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## Major Operation History:
## submitted finished command
## 1 2012-10-17 14:23:16 2012-10-17 14:23:20 methylumiR(filename = "fr2.txt")
## 2 2012-10-17 17:11:19 2012-10-17 17:11:20 Subset of 46 samples.
## 3 2012-10-17 17:11:48 2012-10-17 17:11:48 Subset of 12 samples.
Display how many samples the methylumi object contains and how many probes are covered:
dim(melon)
## Features Samples
## 3363 12
Let’s get an overview of the three data frames in the methylumi object
head(betas(melon))
## 6057825008_R01C01 6057825008_R01C02 6057825008_R02C01
## cg00000029 0.67233 0.71083 0.67504
## cg00000108 0.57508 0.37251 0.65755
## cg00000109 0.75909 0.77251 0.78121
## cg00000165 0.39772 0.52589 0.30660
## cg00000236 0.70793 0.75873 0.75429
## cg00000289 0.43090 0.44909 0.51544
## 6057825008_R02C02 6057825008_R03C01 6057825008_R03C02
## cg00000029 0.69099 0.75096 0.70100
## cg00000108 0.48510 0.60844 0.57751
## cg00000109 0.78238 0.83259 0.77071
## cg00000165 0.43188 0.36626 0.51246
## cg00000236 0.76494 0.77341 0.76085
## cg00000289 0.50341 0.48614 0.48561
## 6057825008_R04C01 6057825008_R04C02 6057825008_R05C01
## cg00000029 0.69015 0.70555 0.73616
## cg00000108 0.55881 0.58700 0.56206
## cg00000109 0.67976 0.81480 0.79522
## cg00000165 0.43329 0.29387 0.50085
## cg00000236 0.75011 0.70859 0.76726
## cg00000289 0.46985 0.40216 0.44669
## 6057825008_R05C02 6057825008_R06C01 6057825008_R06C02
## cg00000029 0.69577 0.71457 0.71944
## cg00000108 0.51776 0.50770 0.54396
## cg00000109 0.77952 0.70430 0.71377
## cg00000165 0.49857 0.48607 0.36159
## cg00000236 0.71562 0.70947 0.75575
## cg00000289 0.51309 0.55050 0.54047
head(pData(melon))
## sampleID label sex
## 6057825008_R01C01 6057825008_R01C01 6057825008_R01C01 M
## 6057825008_R01C02 6057825008_R01C02 6057825008_R01C02 M
## 6057825008_R02C01 6057825008_R02C01 6057825008_R02C01 M
## 6057825008_R02C02 6057825008_R02C02 6057825008_R02C02 M
## 6057825008_R03C01 6057825008_R03C01 6057825008_R03C01 F
## 6057825008_R03C02 6057825008_R03C02 6057825008_R03C02 F
head(fData(melon))
## TargetID ProbeID_A ProbeID_B ILMNID NAME
## cg00000029 cg00000029 14782418 14782418 cg00000029 cg00000029
## cg00000108 cg00000108 12709357 12709357 cg00000108 cg00000108
## cg00000109 cg00000109 59755374 59755374 cg00000109 cg00000109
## cg00000165 cg00000165 12637463 12637463 cg00000165 cg00000165
## cg00000236 cg00000236 12649348 12649348 cg00000236 cg00000236
## cg00000289 cg00000289 18766346 18766346 cg00000289 cg00000289
## ADDRESSA_ID ALLELEA_PROBESEQ
## cg00000029 14782418 AACTATACTAACRAAAAAATATCCAAAAAACACTAACRTATAAAAATTTC
## cg00000108 12709357 ATACAATAAAACAAACCTAAAATAATCCTAACTCCRCTATCATCCTAACC
## cg00000109 59755374 CAATACTAACAAACACATATACCCCCCCACAAATCTTAACTTCTAAATAC
## cg00000165 12637463 CAAAATCTATTAATACAATAACTTTTAATAAAACAACTAAAACACACATC
## cg00000236 12649348 TATAACRTCATATTAAAAAAAACRATCTAACCCACCAATTTATACATCAC
## cg00000289 18766346 ATCTACTATATTCATTTCTCCAATCTCATATCCATTTTAATATAAAAATC
## ADDRESSB_ID ALLELEB_PROBESEQ INFINIUM_DESIGN_TYPE NEXT_BASE
## cg00000029 NA II
## cg00000108 NA II
## cg00000109 NA II
## cg00000165 NA II
## cg00000236 NA II
## cg00000289 NA II
## COLOR_CHANNEL
## cg00000029
## cg00000108
## cg00000109
## cg00000165
## cg00000236
## cg00000289
## FORWARD_SEQUENCE
## cg00000029 TTTTTTAGATAAGGATATCCAGGCGATGAGGAAGTTTTACTTCTGGGAACAGCCTGGATA[CG]AAACCTTCACACGTCAGTGTCTTTTGGACATTTTCTCGTCAGTACAGCCCTGTTGAATGT
## cg00000108 TCCATTTTGAAGGAAAAAAATGAAGGCTCTGAAAGTGTAAATCGCTTACTGAAGGGCACA[CG]GCCAGGATGACAGCGGAGCCAGGATCACCCCAGGTCTGTCTCATTGCATATGTCATGGCT
## cg00000109 GCCTTAGTCCTGAATGAGCCATTTCTCTAAGAAGTCCTGGCTTCTTTTTTAATAGAGAAT[CG]TATTTAGAAGCCAAGATCTGTGGGGGGGTACATGTGCCTGTTAGTATTGCAGTTGTGCCT
## cg00000165 CTAAGTGCAGTCAGGATCTGTTAGTACAGTGGCTTTTGATGGAACAGCTGAGGCACACAT[CG]CCCGTGGCATGGACTCCGGGGCCGAACGCTCACGACCAAGACTTTTGCCCTTTTGAAATG
## cg00000236 CTCAGCGACAGTGTAGCGTCATGTTAGAGGAGACGATCTGACCCACCAGTTTGTACATCA[CG]TCCTGCATGTCCCACACCATTTTTTCATGACCTTGTAATATACTGGTCTCTGTGCTATAG
## cg00000289 CAAGTGAGCTAGCAAACACACATGCACCAATGTGCCTTTTGACAAGAGTACCCCCTACCC[CG]ACTCCCACACCAAAATGGACATGAGATTGGAGAAATGAATACAGCAGATGGAACAGATAG
## GENOME_BUILD CHR MAPINFO
## cg00000029 37 16 53468112
## cg00000108 37 3 37459206
## cg00000109 37 3 171916037
## cg00000165 37 1 91194674
## cg00000236 37 8 42263294
## cg00000289 37 14 69341139
## SOURCESEQ
## cg00000029 GCTGTACTGACGAGAAAATGTCCAAAAGACACTGACGTGTGAAGGTTTCG
## cg00000108 CGGCCAGGATGACAGCGGAGCCAGGATCACCCCAGGTCTGTCTCATTGCA
## cg00000109 AATACTAACAGGCACATGTACCCCCCCACAGATCTTGGCTTCTAAATACG
## cg00000165 AGGATCTGTTAGTACAGTGGCTTTTGATGGAACAGCTGAGGCACACATCG
## cg00000236 GTAGCGTCATGTTAGAGGAGACGATCTGACCCACCAGTTTGTACATCACG
## cg00000289 TCTGCTGTATTCATTTCTCCAATCTCATGTCCATTTTGGTGTGGGAGTCG
## CHROMOSOME_36 COORDINATE_36 STRAND PROBE_SNPS PROBE_SNPS_10
## cg00000029 16 52025613 F
## cg00000108 3 37434210 F rs9857774
## cg00000109 3 173398731 F rs9864492
## cg00000165 1 90967262 R
## cg00000236 8 42382451 R
## cg00000289 14 68410892 F
## RANDOM_LOCI METHYL27_LOCI UCSC_REFGENE_NAME
## cg00000029 NA NA RBL2
## cg00000108 NA NA C3orf35;C3orf35
## cg00000109 NA NA FNDC3B;FNDC3B
## cg00000165 NA NA
## cg00000236 NA NA VDAC3;VDAC3
## cg00000289 NA NA ACTN1;ACTN1;ACTN1
## UCSC_REFGENE_ACCESSION UCSC_REFGENE_GROUP
## cg00000029 NM_005611 TSS1500
## cg00000108 NM_178339;NM_178342 Body;3'UTR
## cg00000109 NM_001135095;NM_022763 Body;Body
## cg00000165
## cg00000236 NM_005662;NM_001135694 3'UTR;3'UTR
## cg00000289 NM_001130005;NM_001130004;NM_001102 3'UTR;3'UTR;3'UTR
## UCSC_CPG_ISLANDS_NAME RELATION_TO_UCSC_CPG_ISLAND
## cg00000029 chr16:53468284-53469209 N_Shore
## cg00000108
## cg00000109
## cg00000165 chr1:91190489-91192804 S_Shore
## cg00000236
## cg00000289 chr14:69341427-69341820 N_Shore
## PHANTOM DMR ENHANCER HMM_ISLAND
## cg00000029 NA
## cg00000108 NA
## cg00000109 low-CpG:173398671-173398760 NA
## cg00000165 CDMR TRUE 1:90967262-90967361
## cg00000236 NA
## cg00000289 NA
## REGULATORY_FEATURE_NAME REGULATORY_FEATURE_GROUP DHS Index
## cg00000029 16:53467838-53469685 Promoter_Associated TRUE 1 NA
## cg00000108 NA 2 NA
## cg00000109 NA 3 NA
## cg00000165 NA 4 NA
## cg00000236 NA 5 NA
## cg00000289 NA 6 NA
In a first filtering step we look at the distribution of methylated and unmethylated signals for each sample, to check whether there are any drastic outliers. If there are any worrisome samples you can go back to RnBeads or other software to check methylation at the control probes.
boxplot(methylated(melon))
boxplot(unmethylated(melon))
boxplot(log(methylated(melon)))
boxplot(log(unmethylated(melon)))
We will next look at colour channel bias. This especially important for type 2 probes, where the methylation signal is generated using red and green fluorescence on the same bead. These tend to result in different density profiles, so we will plot them independently.
boxplotColorBias(melon, channel='methy')
boxplotColorBias(melon, channel='unmethy')
plotColorBias1D(melon)
plotColorBias1D(melon, channel="methy")
plotColorBias1D(melon, channel="unmethy")
plotColorBias2D(melon)
Another preliminary step involves the calculation of the bisulfite conversion rate, which is based on the signals from the methylated control probes. We will skip this for the purposes of the tutorial.
Next we would normally generate PCA or MDS plots labelled by gender, to check for sample mismatches. First we restrict our methylation data to the X and Y chromosome, calculate a 2-dimensional MDS model and plot the sample load on the 2 coordinates, labelled by their gender. Note that a mismatch between MDS coordinate load and gender doesn’t always have to imply a sample mismatch. Poor array processing quality can be another reason for such discrepancies.
melon_s <- melon[fData(melon)$CHR %in% c("X","Y"),]
beta_M <- as.data.frame(t(betas(melon_s)))
d <- dist(beta_M)
fit <- cmdscale(d,eig=TRUE, k=2)
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Metric MDS", type="n")
text(x, y, labels = pData(melon)$sex, cex=.7)
Recently we have also adopted the DNA methylation age estimator to check for sample mismachtches. This model generates a very accurate estimate of the subject’s age based on the DNA methylation at ~300 probes for samples taken from any tissue (+/- 3 years). If any sample shows a very large difference between the estimated DNA methylation age and the age recorded in the phenotypic data, there has probably been a sample mismatch.
The 65 SNP probes on the 450K array can also be useful for sample mismatch checks. If you have a population sample, you wouldn’t expect any two samples to have the exact same genotypes at each of the 65 SNPs. Even more importantly, in studies involving monozygotic twins you can check that the twins have been correctly matched.
melon_snps <- melon[grep("rs", featureNames((melon))),]
heatmap(betas(melon_snps))
Next, we are going to filter out samples and probes based on poor detection p values. The pfilter function in wateRmelon will filter your data by detection p value and minimum bead counts. It will run through 3 filtering steps:
Filter out samples with a detection p-value greater than 0.05
Filter out CpG sites with a beadcount <3 in >= 5% of the samples
Filter out CpG sites with a detection p-value >= 0.05 in >= 1% of the samples
Note that these are the default thresholds, they can be changed in the function call if needed.
melon_pf <- pfilter(melon)
## 0 samples having 1 % of sites with a detection p-value greater than 0.05 were removed
## Samples removed:
## 72 sites were removed as beadcount <3 in 5 % of samples
## 40 sites having 1 % of samples with a detection p-value greater than 0.05 were removed
The next step involves normalising your data across samples. The wateRmelon package offers 15 different normalisation methods and options to caluculate performance metrics for each of them. In our experience, working preliminarily in human blood and brain samples, the dasen method works consistently well and is the officially recommended normalisation method within wateRmelon. Dasen is a quantile normalisation algorithm which normalises type I and type II backgrounds seperately in a first step and then quantile normalises methylated and unmethylated signal intensities separately, calculating normalised betas from the normalised signal intensities.
melon_dasen_pf <- dasen(melon_pf)
Let’s take a look at signal intensities after normalising and basic filtering:
boxplot(log(methylated(melon_dasen_pf)))
boxplot(log(unmethylated(melon_dasen_pf)))
boxplotColorBias(melon_dasen_pf, channel='methy')
boxplotColorBias(melon_dasen_pf, channel='unmethy')
plotColorBias1D(melon_dasen_pf)
plotColorBias1D(melon_dasen_pf, channel="methy")
plotColorBias1D(melon_dasen_pf, channel="unmethy")
plotColorBias2D(melon_dasen_pf)
Before we can start our statistical analyses and association tests, we need to filter out some further problematic or uninformative probes. These include:
65 SNP probes
Control probes
Probes with common SNPs in the single (or multiple) bp extension of the CpG site
Cross-hybridising probes
The first two categories are easily filtered out using the probe names.
melon_final <- melon_dasen_pf[grep("cg", featureNames(melon_dasen_pf)),]
For the latter two categories we use a list of probes that combines information on identified cross-hybridising probes and common variants from the annotation provided by Illumina and the following two publications:
Chen Y, Lemire M, Choufani S, et al. Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics. 2013;8(2):203-209. doi:10.4161/epi.23470.
Blair JD, Price EM. Illuminating Potential Technical Artifacts of DNA-Methylation Array Probes. American Journal of Human Genetics. 2012;91(4):760-762. doi:10.1016/j.ajhg.2012.05.028.
Optionally you can also remove probes on sex chromosomes before conducting your analyses:
melon_auto <- melon_dasen_pf[!fData(melon_dasen_pf)$CHR %in% c("X", "Y"),]
Now we can finally start our statistical analyses and association testing. First, let’s create some artificial data. We will make two new variables in the phenotype data of the quality controlled methylumi object: a dichotomous variable called “Group” and a numeric one called “Height”.
pData(melon_final)$Group <- c(rep(1,6), rep(2,6))
set.seed(2016)
pData(melon_final)$Height <- runif(12, min=1.50, max=2)
pData(melon_final)
## sampleID label sex Group Height
## 6057825008_R01C01 6057825008_R01C01 6057825008_R01C01 M 1 1.590082
## 6057825008_R01C02 6057825008_R01C02 6057825008_R01C02 M 1 1.571472
## 6057825008_R02C01 6057825008_R02C01 6057825008_R02C01 M 1 1.920823
## 6057825008_R02C02 6057825008_R02C02 6057825008_R02C02 M 1 1.566787
## 6057825008_R03C01 6057825008_R03C01 6057825008_R03C01 F 1 1.738751
## 6057825008_R03C02 6057825008_R03C02 6057825008_R03C02 F 1 1.560629
## 6057825008_R04C01 6057825008_R04C01 6057825008_R04C01 F 2 1.808316
## 6057825008_R04C02 6057825008_R04C02 6057825008_R04C02 M 2 1.945274
## 6057825008_R05C01 6057825008_R05C01 6057825008_R05C01 F 2 1.501312
## 6057825008_R05C02 6057825008_R05C02 6057825008_R05C02 M 2 1.526729
## 6057825008_R06C01 6057825008_R06C01 6057825008_R06C01 M 2 1.694344
## 6057825008_R06C02 6057825008_R06C02 6057825008_R06C02 F 2 1.636477
To efficiently run a linear model across all ~400,000 probes we will use an apply function. This allows us to operate a function across rows, columns or entries of large data frames or lists. We will first define a linear model function with no further covariates and then run it across all probes. To convert the output into a better format, we will extract the P values and effect sizes for each probe and annotate this with the Illumina annotation for the probe.
lm_melon <- function(row, var){
lm( row ~ var , na.action=na.exclude)
}
lm_height <- apply(betas(melon_final), 1, lm_melon, var=pData(melon_final)$Height)
head(names(lm_height))
## [1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236"
## [6] "cg00000289"
summary(lm_height[[1]])
##
## Call:
## lm(formula = row ~ var, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.035072 -0.015355 0.001656 0.016251 0.032283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.74765 0.08055 9.282 3.13e-06 ***
## var -0.02469 0.04800 -0.514 0.618
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02409 on 10 degrees of freedom
## Multiple R-squared: 0.02578, Adjusted R-squared: -0.07164
## F-statistic: 0.2646 on 1 and 10 DF, p-value: 0.6181
summary(lm_height[[1]])$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.74765132 0.08054584 9.2823084 3.132024e-06
## var -0.02469294 0.04800062 -0.5144296 6.181285e-01
extractP <- function(x){summary(x)$coefficients[2,4]}
pvals <- sapply(lm_height, extractP)
extractE <- function(x){summary(x)$coefficients[2,1]}
evals <- sapply(lm_height, extractE)
results_height <- data.frame(names(lm_height), pvals, evals)
names(results_height) <- c("ILMNID", "P","Coefficient")
results_height <- merge(results_height, fData(melon_final), by="ILMNID", all=FALSE)
Let’s take a first look at our results. We are interested in the low P values, so we will order our results data frame by P values:
results_height[1:5, c(1:6, 17,18, 27)]
## ILMNID P Coefficient TargetID ProbeID_A ProbeID_B CHR
## 1 cg00000029 0.618128459 -0.02469294 cg00000029 14782418 14782418 16
## 2 cg00000108 0.037683714 0.28034252 cg00000108 12709357 12709357 3
## 3 cg00000109 0.850348163 -0.01660486 cg00000109 59755374 59755374 3
## 4 cg00000165 0.002422221 -0.40309610 cg00000165 12637463 12637463 1
## 5 cg00000236 0.744522360 -0.02287686 cg00000236 12649348 12649348 8
## MAPINFO UCSC_REFGENE_NAME
## 1 53468112 RBL2
## 2 37459206 C3orf35;C3orf35
## 3 171916037 FNDC3B;FNDC3B
## 4 91194674
## 5 42263294 VDAC3;VDAC3
results_height <- results_height[order(results_height$P, decreasing = F),]
results_height[1:5, c(1:6, 17,18, 27)]
## ILMNID P Coefficient TargetID ProbeID_A ProbeID_B CHR
## 1626 cg00071391 0.000237722 -0.20072804 cg00071391 70648423 70648423 4
## 1342 cg00058299 0.000599982 0.17742862 cg00058299 17635368 23744319 1
## 1803 cg00081019 0.001464355 -0.11280767 cg00081019 65682310 10727433 1
## 1493 cg00065385 0.002225494 0.02178082 cg00065385 70695427 70695427 9
## 4 cg00000165 0.002422221 -0.40309610 cg00000165 12637463 12637463 1
## MAPINFO UCSC_REFGENE_NAME
## 1626 126560514
## 1342 113498193 SLC16A1;SLC16A1;SLC16A1
## 1803 243658800 SDCCAG8;AKT3
## 1493 111623395 ACTL7A
## 4 91194674
This was the most basic form of a linear model using no further covariates. However, in general, there are many potential confounders in DNA methylation data, including age, sex, tissue or cell type, birthweight, smoking status, infections, medication, genotype, etc. When you conduct your analyses you will need to take into account confounding. Depending on your epidemiological study question and samples you can include the potential confounders as covariates in your linear model or regress them out of your methylation data and work on residuals for further analyses. Here we will include sex as an additional covariate as an example. We first define a new linear model function, which includes an additional covariate and then go through the same steps as previously:
lm_conf <- function(row, var, conf){
lm( row ~ var + conf, na.action=na.exclude)
}
lm_height_conf <- apply(betas(melon_final), 1, lm_conf, var=pData(melon_final)$Height, conf=pData(melon_final)$sex)
summary(lm_height_conf[[1]])
##
## Call:
## lm(formula = row ~ var + conf, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.037431 -0.016720 0.005168 0.016670 0.032256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.746293 0.083563 8.931 9.09e-06 ***
## var -0.021064 0.050217 -0.419 0.685
## confM -0.008071 0.014758 -0.547 0.598
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02498 on 9 degrees of freedom
## Multiple R-squared: 0.05711, Adjusted R-squared: -0.1524
## F-statistic: 0.2726 on 2 and 9 DF, p-value: 0.7675
summary(lm_height_conf[[1]])$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.746293049 0.08356333 8.9308681 9.093583e-06
## var -0.021064304 0.05021718 -0.4194641 6.847121e-01
## confM -0.008070689 0.01475836 -0.5468556 5.977732e-01
extractP_H <- function(x){summary(x)$coefficients[2,4]}
pvals_H <- sapply(lm_height_conf, extractP_H)
extractE_H <- function(x){summary(x)$coefficients[2,1]}
evals_H <- sapply(lm_height_conf, extractE_H)
extractP_S <- function(x){summary(x)$coefficients[3,4]}
pvals_S <- sapply(lm_height_conf, extractP_S)
extractE_S <- function(x){summary(x)$coefficients[3,1]}
evals_S <- sapply(lm_height_conf, extractE_S)
results_height_conf <- data.frame(names(lm_height_conf), pvals_H, evals_H, pvals_S, evals_S)
names(results_height_conf) <- c("ILMNID", "P_Height","Coefficient_Height", "P_Sex","Coefficient_Sex")
results_height_conf <- merge(results_height_conf, fData(melon_final), by="ILMNID", all=FALSE)
Let’s take a look at our results. Again, we are interested in the low P values, so we will order our results data frame by P values, first for the height phenotype, then for the sex covariate.
results_height_conf[1:7, c(1:6, 19,20, 29)]
## ILMNID P_Height Coefficient_Height P_Sex Coefficient_Sex
## 1 cg00000029 0.68471209 -0.02106430 0.5977732 -0.008070689
## 2 cg00000108 0.03858932 0.29233615 0.4715487 -0.026675825
## 3 cg00000109 0.80136239 -0.02323186 0.5895240 0.014739558
## 4 cg00000165 0.00293684 -0.41441283 0.4255705 0.025170290
## 5 cg00000236 0.87962530 -0.01014689 0.1731868 -0.028313564
## 6 cg00000289 0.74462415 -0.03588702 0.6016348 0.016985540
## 7 cg00000292 0.78151916 0.02012270 0.6712631 -0.009077305
## TargetID CHR MAPINFO UCSC_REFGENE_NAME
## 1 cg00000029 16 53468112 RBL2
## 2 cg00000108 3 37459206 C3orf35;C3orf35
## 3 cg00000109 3 171916037 FNDC3B;FNDC3B
## 4 cg00000165 1 91194674
## 5 cg00000236 8 42263294 VDAC3;VDAC3
## 6 cg00000289 14 69341139 ACTN1;ACTN1;ACTN1
## 7 cg00000292 16 28890100 ATP2A1;ATP2A1
results_height_conf <- results_height_conf[order(results_height_conf$P_H, decreasing = F),]
results_height_conf[1:7, c(1:6, 19,20, 29)]
## ILMNID P_Height Coefficient_Height P_Sex
## 1626 cg00071391 0.0005477237 -0.1996946 8.424956e-01
## 483 cg00020474 0.0010710830 -0.1011926 5.869046e-03
## 1249 cg00054197 0.0012167921 -0.1969910 4.168535e-02
## 1342 cg00058299 0.0012613420 0.1764956 8.575781e-01
## 1803 cg00081019 0.0015036273 -0.1072936 1.144805e-01
## 968 cg00040419 0.0021104057 0.1586827 1.607576e-07
## 2037 cg13790727 0.0024921229 -0.1242099 3.835369e-02
## Coefficient_Sex TargetID CHR MAPINFO
## 1626 -0.002298554 cg00071391 4 126560514
## 483 -0.022542164 cg00020474 1 8214963
## 1249 0.029602357 cg00054197 5 171765750
## 1342 0.002075174 cg00058299 1 113498193
## 1803 -0.012264235 cg00081019 1 243658800
## 968 0.157628845 cg00040419 X 13670660
## 2037 0.021333303 cg13790727 20 36148738
## UCSC_REFGENE_NAME
## 1626
## 483
## 1249 SH3PXD2B
## 1342 SLC16A1;SLC16A1;SLC16A1
## 1803 SDCCAG8;AKT3
## 968 TCEANC
## 2037 BLCAP;BLCAP;BLCAP;BLCAP;NNAT;NNAT;BLCAP
results_height_conf <- results_height_conf[order(results_height_conf$P_S, decreasing = F),]
results_height_conf[1:7, c(1:6, 19,20, 29)]
## ILMNID P_Height Coefficient_Height P_Sex Coefficient_Sex
## 278 cg00011891 0.5432393 -0.010486709 9.160899e-15 -0.4567849
## 616 cg00026186 0.8999601 0.002182889 1.752378e-14 -0.4322441
## 899 cg00037117 0.7516120 0.008537798 6.148196e-12 -0.3487708
## 973 cg00040455 0.6093479 0.011812695 6.393278e-12 -0.2961960
## 208 cg00008945 0.8212187 0.004385784 8.969097e-12 -0.2409309
## 716 cg00029931 0.4385955 0.039390554 4.947369e-10 -0.3966460
## 1102 cg00046099 0.6453846 0.030559576 1.099198e-09 -0.4789119
## TargetID CHR MAPINFO UCSC_REFGENE_NAME
## 278 cg00011891 X 153714660 UBL4A
## 616 cg00026186 X 48367230 PORCN;PORCN;PORCN;PORCN;PORCN
## 899 cg00037117 X 44730718
## 973 cg00040455 X 30326676 NR0B1
## 208 cg00008945 X 80377379 HMGN5
## 716 cg00029931 X 100645741 RPL36A
## 1102 cg00046099 X 54556443 GNL3L
An alternative to a linear model, when you are looking at a dichotomous independent variable, is the t test. This tests for group mean difference in DNA methylation between two groups (e.g. cases and controls), taking into account the within- and between-group variance in DNA methylation. It won’t make a difference to using a linear model without covariates in the default settings as the lm will default to a t test when your (only) independent variable is dichotomous. It’s still useful to know how to run a t test, and the t test function allows you to use many more customisable settings, so I have summarised the steps below - in the same structure as the linear model. The first main difference is that you need to provide the t test function with two numeric vectors - DNA methylation in cases and DNA methylation in controls, so you will have to subset your data for this step.
group1 <- pData(melon_final)$Group==1
group2 <- pData(melon_final)$Group==2
tt_melon <- function(row, var){
t.test( row[group1], row[group2], na.action=na.exclude)
}
tt_group <- apply(betas(melon_final), 1, tt_melon)
head(names(tt_group))
## [1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236"
## [6] "cg00000289"
tt_group[[1]]
##
## Welch Two Sample t-test
##
## data: row[group1] and row[group2]
## t = 0.43663, df = 9.9476, p-value = 0.6717
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02503175 0.03722247
## sample estimates:
## mean of x mean of y
## 0.7094186 0.7033232
extractP <- function(x){x$p.value}
pvals <- sapply(tt_group, extractP)
extractE <- function(x){x$estimate[1]-x$estimate[2]}
evals <- sapply(tt_group, extractE)
results_group <- data.frame(names(tt_group), pvals, evals)
names(results_group) <- c("ILMNID", "P","Coefficient")
results_group <- merge(results_group, fData(melon_final), by="ILMNID", all=FALSE)
Let’s take a look at our results. We will order our results data frame by P values, as before:
results_group[1:5, c(1:6, 17,18, 27)]
## ILMNID P Coefficient TargetID ProbeID_A ProbeID_B CHR
## 1 cg00000029 0.6717048 0.006095362 cg00000029 14782418 14782418 16
## 2 cg00000108 0.4990577 0.029631721 cg00000108 12709357 12709357 3
## 3 cg00000109 0.4005193 0.021491797 cg00000109 59755374 59755374 3
## 4 cg00000165 0.5249029 0.030351635 cg00000165 12637463 12637463 1
## 5 cg00000236 0.2935844 0.021137782 cg00000236 12649348 12649348 8
## MAPINFO UCSC_REFGENE_NAME
## 1 53468112 RBL2
## 2 37459206 C3orf35;C3orf35
## 3 171916037 FNDC3B;FNDC3B
## 4 91194674
## 5 42263294 VDAC3;VDAC3
results_group <- results_group[order(results_group$P, decreasing = F),]
results_group[1:5, c(1:6, 17,18, 27)]
## ILMNID P Coefficient TargetID ProbeID_A ProbeID_B
## 199 cg00008695 0.0001396302 0.01307807 cg00008695 57668407 57668407
## 1307 cg00056676 0.0003323476 -0.02785579 cg00056676 13769381 13769381
## 820 cg00034039 0.0004086537 -0.02416043 cg00034039 62763343 62763343
## 264 cg00011350 0.0005432959 0.02961043 cg00011350 27734483 73698334
## 201 cg00008737 0.0010179038 0.02717505 cg00008737 70742487 70742487
## CHR MAPINFO UCSC_REFGENE_NAME
## 199 8 960721
## 1307 5 101631668 SLCO4C1
## 820 9 74526266 C9orf85;FAM108B1;FAM108B1
## 264 12 49444296 MLL2
## 201 19 51021200 LRRC4B
In the last section we will look at some basic plotting options to describe the outcomes of our epigenome-wide association studies. To summarise a whole EWAS into a single plot you can basically use the same plot types that are employed in GWAS. We will use the qqman package, which you have seen in the GWAS tutorial on Monday, to plot a Manhattan plot - an overview of p values across the genome - as well as a qq plot showing your empirical p values compared to theoretical p values expected under null hypotheses and normally distributed data.
We first have to reformat some of the variables in our results data frames and can then proceed to plot the Manhattan and qq plots. We will start with our (covariate-free) Height EWAS:
table(results_height$CHR)
##
## 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4
## 0 271 73 82 103 57 40 72 142 122 33 99 184 98 13 33 97 97
## 5 6 7 8 9 X Y
## 84 166 128 97 28 36 0
class(results_height$CHR)
## [1] "factor"
results_height$CHR <- as.character(results_height$CHR)
results_height$CHR[results_height$CHR=="X"] <- "23"
results_height$CHR[results_height$CHR=="Y"] <- "24"
results_height$CHR <- as.numeric(results_height$CHR)
names(results_height)
## [1] "ILMNID" "P"
## [3] "Coefficient" "TargetID"
## [5] "ProbeID_A" "ProbeID_B"
## [7] "NAME" "ADDRESSA_ID"
## [9] "ALLELEA_PROBESEQ" "ADDRESSB_ID"
## [11] "ALLELEB_PROBESEQ" "INFINIUM_DESIGN_TYPE"
## [13] "NEXT_BASE" "COLOR_CHANNEL"
## [15] "FORWARD_SEQUENCE" "GENOME_BUILD"
## [17] "CHR" "MAPINFO"
## [19] "SOURCESEQ" "CHROMOSOME_36"
## [21] "COORDINATE_36" "STRAND"
## [23] "PROBE_SNPS" "PROBE_SNPS_10"
## [25] "RANDOM_LOCI" "METHYL27_LOCI"
## [27] "UCSC_REFGENE_NAME" "UCSC_REFGENE_ACCESSION"
## [29] "UCSC_REFGENE_GROUP" "UCSC_CPG_ISLANDS_NAME"
## [31] "RELATION_TO_UCSC_CPG_ISLAND" "PHANTOM"
## [33] "DMR" "ENHANCER"
## [35] "HMM_ISLAND" "REGULATORY_FEATURE_NAME"
## [37] "REGULATORY_FEATURE_GROUP" "DHS"
## [39] "Index" "Var.40"
names(results_height)[18] <- "BP"
manhattan(results_height, main = "Manhattan plot for Height EWAS", cex = 0.5, cex.axis = 0.8)
## Warning in manhattan(results_height, main = "Manhattan plot for Height
## EWAS", : No SNP column found. OK unless you're trying to highlight.
qq(results_height$P, main = "Q-Q plot of Height EWAS p-values")
Next, we’ll produce the same plots for our Group EWAS, which also means we’ll have to go through the same reformatting steps:
table(results_group$CHR)
##
## 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4
## 0 271 73 82 103 57 40 72 142 122 33 99 184 98 13 33 97 97
## 5 6 7 8 9 X Y
## 84 166 128 97 28 36 0
class(results_group$CHR)
## [1] "factor"
results_group$CHR <- as.character(results_group$CHR)
results_group$CHR[results_group$CHR=="X"] <- "23"
results_group$CHR[results_group$CHR=="Y"] <- "24"
results_group$CHR <- as.numeric(results_group$CHR)
names(results_group)
## [1] "ILMNID" "P"
## [3] "Coefficient" "TargetID"
## [5] "ProbeID_A" "ProbeID_B"
## [7] "NAME" "ADDRESSA_ID"
## [9] "ALLELEA_PROBESEQ" "ADDRESSB_ID"
## [11] "ALLELEB_PROBESEQ" "INFINIUM_DESIGN_TYPE"
## [13] "NEXT_BASE" "COLOR_CHANNEL"
## [15] "FORWARD_SEQUENCE" "GENOME_BUILD"
## [17] "CHR" "MAPINFO"
## [19] "SOURCESEQ" "CHROMOSOME_36"
## [21] "COORDINATE_36" "STRAND"
## [23] "PROBE_SNPS" "PROBE_SNPS_10"
## [25] "RANDOM_LOCI" "METHYL27_LOCI"
## [27] "UCSC_REFGENE_NAME" "UCSC_REFGENE_ACCESSION"
## [29] "UCSC_REFGENE_GROUP" "UCSC_CPG_ISLANDS_NAME"
## [31] "RELATION_TO_UCSC_CPG_ISLAND" "PHANTOM"
## [33] "DMR" "ENHANCER"
## [35] "HMM_ISLAND" "REGULATORY_FEATURE_NAME"
## [37] "REGULATORY_FEATURE_GROUP" "DHS"
## [39] "Index" "Var.40"
names(results_group)[18] <- "BP"
manhattan(results_group, main = "Manhattan plot for Group EWAS", cex = 0.5, cex.axis = 0.8)
## Warning in manhattan(results_group, main = "Manhattan plot for Group
## EWAS", : No SNP column found. OK unless you're trying to highlight.
qq(results_group$P, main = "Q-Q plot of Group EWAS p-values")
At the individual probe level, you will usually want to show DNA methylation patterns at your top associated probes using some form of scatter plot. We will start with a simple scatter plot and a regression line for our top hit locus in the Height EWAS. To facilitate this we first make a new data frame which contains DNA methylation data at the top probe as well as phenotypic data on the samples and then generate the scatter plot:
f <- data.frame(betas(melon_final)[featureNames(melon_final) == results_height$ILMNID[1],], pData(melon_final)$Height, pData(melon_final)$sex)
names(f) <- c("Probe", "Height", "Sex")
f$Height <- as.numeric(f$Height)
f$Sex <- as.factor(f$Sex)
ggplot(f, aes(x = Height, y = Probe*100)) + geom_point(pch=21) + ylab("DNA Methylation (%)")
We will next add a regression line. In the default settings this will also plot standard errors around the regression line.
ggplot(f, aes(x = Height, y = Probe*100)) + geom_point(pch=21) + stat_smooth(method = "lm") + ylab("DNA Methylation (%)")
To include even more information in the plot, we can also colour the data points by sample sex. In this case I would recommend to change the regression line colour to black, to limit the number of colours in the plot.
ggplot(f, aes(x = Height, y = Probe*100)) + geom_point(pch=21, aes(colour=f$Sex)) + stat_smooth(method = "lm", colour="black") + ylab("DNA Methylation (%)")
Finally we will make a scatter plot for the top hit in the Group EWAS. This will be slightly different to the Height EWAS, since the Group phenotype is dichotomous. First we will start with a basic scatter plot. Here, we are also colouring the data points by Group, but will suppress the figure legend.
f <- data.frame(betas(melon_final)[featureNames(melon_final)==results_group$ILMNID[1],], pData(melon_final)$Group)
names(f) <- c("Probe", "Group")
f$Group <- as.factor(f$Group)
ggplot(f, aes(x=Group, y=Probe*100)) + theme(legend.position = "none") + geom_point(aes(colour=Group)) + ylab("DNA Methylation (%)")
Next we will introduce a little bit of random variation along the horizontal axis to better visualize the data points and avoid overlapping points.
ggplot(f, aes(x=Group, y=Probe*100)) + theme(legend.position = "none") + geom_point(aes(colour=Group), position=position_jitter(width=0.3, height=0)) + ylab("DNA Methylation (%)")
And lastly we will add a boxplot for each group on top of the scatter plot:
ggplot(f, aes(x=Group, y=Probe*100)) + theme(legend.position = "none") + geom_point(aes(colour=Group), position=position_jitter(width=0.3, height=0)) + geom_boxplot(fill=NA, outlier.colour=NA) + ylab("DNA Methylation (%)")